Master Advanced Error-Handling to Make PySpark Pipelines Production-Ready
Building GitOps Pipelines With Helm on OpenShift: Lessons From the Trenches
Kubernetes in the Enterprise
Over a decade in, Kubernetes is the central force in modern application delivery. However, as its adoption has matured, so have its challenges: sprawling toolchains, complex cluster architectures, escalating costs, and the balancing act between developer agility and operational control. Beyond running Kubernetes at scale, organizations must also tackle the cultural and strategic shifts needed to make it work for their teams.As the industry pushes toward more intelligent and integrated operations, platform engineering and internal developer platforms are helping teams address issues like Kubernetes tool sprawl, while AI continues cementing its usefulness for optimizing cluster management, observability, and release pipelines.DZone's 2025 Kubernetes in the Enterprise Trend Report examines the realities of building and running Kubernetes in production today. Our research and expert-written articles explore how teams are streamlining workflows, modernizing legacy systems, and using Kubernetes as the foundation for the next wave of intelligent, scalable applications. Whether you're on your first prod cluster or refining a globally distributed platform, this report delivers the data, perspectives, and practical takeaways you need to meet Kubernetes' demands head-on.
Getting Started With CI/CD Pipeline Security
Java Caching Essentials
Hey, DZone Community! We have an exciting year of research ahead for our beloved Trend Reports. And once again, we are asking for your insights and expertise (anonymously if you wish) — readers just like you drive the content we cover in our Trend Reports. Check out the details for our research survey below. Database Systems Research With databases powering nearly every modern application nowadays, how are developers and organizations utilizing, managing, and evolving these systems — across usage, architecture, operations, security, and emerging trends like AI and real-time analytics? Take our short research survey (~10 minutes) to contribute to our upcoming Trend Report. Oh, and did we mention that anyone who takes the survey could be one of the lucky four to win an e-gift card of their choosing? We're diving into key topics such as: The databases and query languages developers rely onExperiences and challenges with cloud migrationPractices and tools for data security and observabilityData processing architectures and the role of real-time analyticsEmerging approaches like vector and AI-assisted databases Join the Database Systems Research Over the coming month, we will compile and analyze data from hundreds of respondents; results and observations will be featured in the "Key Research Findings" of our upcoming Trend Report. Your responses help inform the narrative of our Trend Reports, so we truly cannot do this without you. Stay tuned for each report's launch and see how your insights align with the larger DZone Community. We thank you in advance for your help! —The DZone Content and Community team
If you've attempted to build a dashboard, then you're familiar with the hassle of polling. You hit your API every couple of seconds, grab updates, and pray your data doesn't feel stale. However, if we're being honest, polling is inefficient, wasteful, and antiquated. In the modern era, users expect supplies to be dynamic and flowing. We, as developers, should meet that expectation without melting our servers. In this post, I will walk you through a serverless, event-driven architecture that I've leveraged to build real-time dashboards using AWS. This architecture will tie together EventBridge, OpenSearch, and API Gateway WebSockets with a hint of Lambda and DynamoDB. By the end, you'll have some understanding of how all the pieces are tied together to create a live dashboard data pipeline that can scale, can be cost-friendly, and actually feels fast for the end-user. Let’s get started! Why Not Just Poll? Traditional dashboards depend on queries every few seconds on some database. While this is fairly simple, it has a major disadvantage: Latency: The data feels outdated, always just behind.Cost: There are excessive, unnecessary queries being created for every poll that is hitting your backend.User experience: Users see stale charts, and they get frustrated staring at a chart that doesn't feel dynamic. Instead of forcing the UI to constantly check back in, "Are we there yet?", we change that—events will push to the dashboard as they happen. Now we consider the AWS trio: Amazon EventBridge – the backbone of the architecture, as it captures domain events.Amazon OpenSearch Service – provides fast indexing and returning based on event queries.Amazon API Gateway (WebSocket) + Lambda + DynamoDB – provides a live communication layer that pushes updates to each client in real-time. Sounds great thus far? Let’s review how the architecture all works together. Architecture Overview Below shows an end-to-end flow of how events are pushed through the architecture: A downstream service emits an event → EventBridge captures the event.EventBridge routes the event to an Indexing Lambda, which normalizes and stores the event to OpenSearch.The event succeeds indexing, and the Lambda creates a 'delta' event back into EventBridge.That event in EventBridge triggers the Broadcast Lambda, which looks up active WebSocket connections in DynamoDB.The Broadcast Lambda pushes updates to clients over API Gateway WebSockets.Clients render changes immediately — no refreshing, no polling. An illustration (very ASCII) of the flow: Plain Text Service → EventBridge → Indexing Lambda → OpenSearch ↓ EventBridge (delta) ↓ Broadcast Lambda ↓ API Gateway (WebSocket) → Clients (UI) Step 1: Indexing Data to OpenSearch Before pushing the event to dashboards, it is always best to have an indexing strategy. Here is one example to use for an index template: JSON { "index_patterns": ["metrics-*"], "template": { "settings": { "number_of_shards": 1, "number_of_replicas": 1 }, "mappings": { "dynamic": "false", "properties": { "@timestamp": { "type": "date" }, "service": { "type": "keyword" }, "eventType": { "type": "keyword" }, "latencyMs": { "type": "long" }, "message": { "type": "text", "fields": { "raw": { "type": "keyword" } } } } } } } A few best practices I learned the hard way: Use stable IDs (eventId) for documents to remove duplicates.Use Index State Management (ISM) to rotate indices in a daily time frame.Use ISM to auto-expire outdated data by a time of 14 days (or however long you want). Why? Because dashboards are not data lakes. You should be able to query the data you want and not have to page through potentially large indexes. Step 2: Infrastructure for WebSocket Firstly, you need a serverless WebSocket API. We will use API Gateway to create the WebSocket API. There are three important routes you care about: $connect – This will save the connectionId in DynamoDB.$disconnect – Removing it upon client termination.$default – This only needs to return optional messages from the client (e.g., when a user subscribes to a channel). Here is an example of the DynamoDB table set up (via SAM/CDK): YAML Resources: ConnTable: Type: AWS::DynamoDB::Table Properties: BillingMode: PAY_PER_REQUEST AttributeDefinitions: - AttributeName: connectionId AttributeType: S KeySchema: - AttributeName: connectionId KeyType: HASH For each active client, we put a row in this table. That is how the Broadcast Lambda knows who is online. Step 3: Broadcasting Events Once the new event is indexed, we are off to the races. We send it out to all the active clients without hesitation. Here is a sample Lambda function: JavaScript import { DynamoDBClient, ScanCommand } from "@aws-sdk/client-dynamodb"; import { ApiGatewayManagementApiClient, PostToConnectionCommand } from "@aws-sdk/client-apigatewaymanagementapi"; const ddb = new DynamoDBClient(); const api = new ApiGatewayManagementApiClient({ endpoint: process.env.WS_ENDPOINT }); const TABLE = process.env.TABLE; export const broadcast = async (event) => { const payload = event.detail || event; const conns = await ddb.send(new ScanCommand({ TableName: TABLE })); const targets = conns.Items.map(i => i.connectionId.S); const msg = JSON.stringify({ type: "hits", hits: payload.hits || [payload.doc] }); await Promise.allSettled( targets.map(id => api.send( new PostToConnectionCommand({ ConnectionId: id, Data: Buffer.from(msg) }) ) ) ); return { sent: targets.length }; }; As you can see: Scan DynamoDB to get the active connections.Serialize the payloads into the JSON message.Utilize multiple promises with Promise.allSettled so that if one connection breaks, the rest of the batch goes through. Step 4: Client Side Let’s keep things unbelievably simple here: HTML <script> const ws = new WebSocket('wss://<api-id>.execute-api.<region>.amazonaws.com/prod'); ws.onmessage = (evt) => { const msg = JSON.parse(evt.data); if (msg.type === 'hits') { console.log("Live update:", msg.hits); // Update your chart or table here } }; </script> That's all the magic it takes to bring the data to life, no cron jobs, no reloads, just instant feedback. Lessons Learned (The Hard Way) Here are some keys points I wish someone would have told me sooner. Idempotency – Always, always, always use steady id's for your documents to avoid duplicate assignments.Do not spam everyone – Utilize channel attributes in DynamoDB, which will allow you to push events to just the right clients.There are size limits – Each WebSocket message is limited to a set size of 32 KB; if you go over, batch or sample your payload.Do not forget about costs – Do filtering in OpenSearch query, and only push deltas through WebSockets. Security First Attach a Lambda Authorizer to the $connect to validate JWTs or API keys. Have you thought about it; do you really want every single user to see every event? Or would they rather filter events by, team, service or location? Frequently Asked Questions Q: Can this work without OpenSearch? Yes, you could just push raw events directly to clients. But dashboard interfaces generally require querying, filtering and analytics. OpenSearch can do that. Q: How many WebSocket clients can API Gateway handle? Tens of thousands. For very high scale scenarios, shard your connections and distribute across multiple WebSocket API end points. Q: What about retries to a failed push? There are DLQs (Dead Letter Queues) or retries in your Broadcast Lambda. You should get any failed connections and remove them from DynamoDB quickly. Q: Is this overkill for small apps? Honestly. Yes. If you are building a hobby project, polling every five seconds is fine. Once you want to start providing real-time metrics to users or users on dashboards this will be worth it. Conclusion With EventBridge, OpenSearch, and WebSockets we can get rid of polling and build dashboards that feel alive in real-time. On top of that, this stack is completely serverless and can scale with your traffic without you having to babysit anything. The next time someone asks you to stream metrics, logs or KPIs into a dashboard interface consider using this pipeline as opposed to cron jobs. So what do you think, would you replace your polling inside your dashboards with this option? Or do you feel there could be challenges that I did not mention? Regardless, I would like to hear your thoughts on how you might change this experience into your own projects.
To decrease latency, improve responsiveness, and lessen database loads, caching has become crucial for today's performance-demanding applications. Successful caching strategies can be implemented by developers using Redis or AWS ElastiCache in conjunction with Spring Boot's elegant caching abstraction. Low-latency responses, high throughput, cost-effectiveness, and scalability are all becoming more and more important for modern applications. Effective caching can improve responsiveness to requests for recently requested data by 10–100 times, while reducing the database load by 70–90%. Patterns of Caching Lazy Loading Cache-Aside Pattern The most common caching pattern is cache-aside, also known as lazy loading, in which data is loaded into the cache by the application code. The application searches the cache when data is requested. The data are returned instantly when cached data is present (cache hit). The data are taken from the primary data store and cached for the subsequent request if there is no data (cache miss). When to apply: Read a lot of work with erratic-access patterns. Applications whose functionality will not be accessible due to cache failures. If you wish to manage precisely which data is cached,Advantages: Easy to comprehend and apply. Application availability is unaffected by cache failure. Only the requested data is cached; no extraneous data is stored. Drawbacks: Cache miss, startup/initial penalty. If invalidation is not adequately managed, stale data may persist longer in the cache. Cache management logic is handled by the application code. Write-Through Pattern The write-through pattern guarantees that the cache always contains consistent new data by executing all data writes for the cache and primary data store simultaneously. Before a successful acknowledgment is obtained, all write operations are performed in both storage systems. When to apply: Applications that need consistent data workloads requiring a lot of writing, where new data is essential for reading more quickly than writing Advantages: Database and cache consistency, minimizing instances of cache misses. Situations that are easier to read Drawbacks: Write latency increased (two storage operations). More intricate error-handling. Data that is rarely used can cause cache pollution Write-Behind (Write-Back) Pattern Because the Write-Behind pattern updates the cache before asynchronously updating the primary data store, it can offer exceptional write performance. While the database update can occur during slower periods when it optimizes performance but adds complexity to ensure data durability, the cache update occurs instantly. When to apply: Regularly writes Applications that are capable of withstanding eventual consistency. When writing, performance is crucial. Advantages: Outstanding writing performance, decreased database load. It is simple to accomplish batched writes. Drawbacks: Data loss risk in the event that the cache fails before being stored. The Caching Architecture of Spring Boot A single API that is compatible with several caching providers is offered by Spring Boot's caching abstraction. The abstraction uses aspect-oriented programming (AOP) to automatically apply caching logic (i.e., transparent caching) and intercept service calls. Annotation-based approach: Annotations are used in declarative caching to explain caching behavior. @Cacheable: If the cache is empty, this initiates the cache population process. @CachePut: Updates the cache even if an entry has already been made. @CacheEvict: Eliminates a cache entry @Caching: Combines multiple cache operations@CacheConfig: Offers a standard base cache configuration at the class level. Generation of Cache Keys The default strategy uses all method paramsCustom key generation using spEL expressionsKey generators for when you have complex scenariosNotes on serializing objects Redis Integration Strategies Single-Instance Redis Sufficient for development and/or small applicationSingle point of failureSimpler configuration and monitoringLimited scalability Redis Cluster Mode Scale horizontally with multiple nodesAutomatic failover for high availabilityData partitioned across the cluster nodeMore complicated installation/monitoring Data Serialization Considerations The serialization choice does have performance implications: JSON Serialization Human-readableReadable in multiple languagesTakes up more spaceSerialization/deserialization is slow Binary Serialization Compact / data efficientFaster parsingNot human readablePossible problems with data version compatibility AWS ElastiCache Integration Benefits of ElastiCache for Redis AWS ElastiCache offers managed Redis infrastructure with: Managed service: Automatic backups, patching, monitoringHigh availability: Multi-AZ deployments with automatic failoverScalability: Vertical and horizontal scalingSecurity: VPC integration, encryption-at-rest, and in-flightPerformance: Superior I/O and memory optimization Cluster Configuration Strategies Replication Groups Primary-replica architecture for read scalingAutomatic failover to replica nodesCross-AZ replication for DR Cluster Mode Data is sharded across multiple shardsMax 500 nodes per clusterAutomatic resharding Advanced Caching Techniques Multi-Level Caching Multi-level caching is combining local and distributed caching for maximum efficiency. There are two levels to this strategy: Level 1 (L1) — Local Cache In-memory cache within each application instanceAccess is ultra-fast (nanoseconds)Limited by JVM heap sizeNo network overhead Level 2 (L2) — Distributed Cache Shared across application instancesCapacity greater than the size limitation of the JVM heapBut vulnerable to latency when accessingConsistent across every deployment of the application Cache Warming Strategies Cache warming is the act of preloading the cache with the most frequently accessed data, freeing the user from the cost of additional initial cache misses. Proactive cache warming not only saves time and improves allocation performance, but also significantly improves perceived performance. Application Startup Warning Warming up the application at the time of initialization.Application-specific analytic information to identify frequent or accessed data.Warming up can be done stone cold or gradually to avoid overloading the system with the initial call. Scheduled Warming For time-perishable data, we can periodically refresh the cache.It is easier to set up warming during off-peak times.Certain times or days of the week may show a predictable pattern, which may be effective in pre-warming. Intelligent Cache Invalidation Strong cache invalidation achieves consistency with the best performance: Time-To-Live (TTL) Invalidation Easiest and the most predictable invalidationServing stale data is a risk, by invalidating after a TTL durationMeaningful for data with known update patterns Event-Driven Invalidation Invalidation as soon as data is changedMore complicated dependencies to manageProvides fresh data Tag-Based Invalidation Tags allow groups of shared cache entries.Cache invalidation for closely related data can be done in bulk.The dependency management is simplified. Performance Optimization Techniques Connection Pool Management Connection pool management is an important place to configure for Redis performance: Pool size: Understand the trade-off of resource consumption with performanceConnection validation: Know the health of the connection and have a happy pathTimeout configuration: Understand when connections hangRetry logic: Think of transient failures and what the application can tolerate. Memory Management There are many factors in memory use with Redis: Memory policies: Eviction policies (LRU, LFU, TTL)Data structures: Be cognizant of what Redis data type to useCompression: Compress large valuesMemory monitoring: Fragmentation and memory usage monitoring Network Optimization To eliminate network overhead: Pipelining: Multiple commandsConnection multiplexing: Reusing connectionsData locality: Place the cache and the application togetherCompression: Reduce payload on data during streaming Monitoring and Observability Key Metrics TypemetricpurposePerformanceCache hit ratioMeasure cache effectivenessAvg. response timeTrack system speedThroughput (ops/sec)Monitor system capacityMemory utilizationTrack resource usageReliabilityConnection pool usageTrack system failuresError ratesTrack system failuresFailover eventsEnsure high availabilityNetwork latencyMonitor network performanceBusinessCost per cache opTrack operational costsDB load reductionMeasure cache benefitsUser experience impactMonitor user-facing performance Health Monitoring Check typewhat to monitoractionConnectionRedis connectivityVerify active connectionsPerformanceResponse timesDetect speed degradationCapacityMemory & connectionsWatch resource thresholdsDependenciesDownstream systemsMonitor external health Security Considerations AreaControlImplementationAuthenticationRedis AUTHEnable password authEncryptionTLSProtect data in transitNetworkVPCEnforce network isolationAccess ControlIAM rolesFine-grained permissionsData ProtectionEncryption at RestSecure stored cache dataCredentialsKey RotationRotate secrets regularlyMonitoringAudit LoggingTrack access & changesData HandlingSensitive Data PolicyAvoid caching sensitive info Testing Strategies Test TypeFocus AreaWhat to ValidateUnitMock dependenciesIsolate business logicCache behaviorVerify cache annotationsKey generationEnsure consistencyIntegrationEnd-to-end flowValidate workflow with cachePerformance impactMeasure speed improvementsFailure scenariosVerify graceful fallbackLoadCache under stressCheck scale limitsFailover testingEnsure HA worksCapacity planningDetermine optimal sizing Best Practices AreaPracticeDescriptionCache KeysConsistent naminUse standard conventionsHierarchical structureLogical grouping of keysAvoic collisionsPrevent key conflictsExpiration patternsUse TTL for lifecycle management.Error HandlingGraceful fallbackHandle cache downtime smoothlyProper loggingRecord cache failuresAvoid cascadingPrevent failure propagationData ConsistencyInvalidation strategyClear/update cache correctlyMonitor freshnessTrack data age & stalenessHandle concurrencyManage simultaneous updates Common Pitfalls PitfallProblemSolutionCache StampedeMultiple threads reload same keyUse locking/request coalescingHot Key ProblemUneven load on few keysDistribute/shard keysMemory LeaksUnbounded growthApply TTL & size limitsStale DataOld values servedProper invalidation strategyOver-CachingStoring unnecessary dataCache only hot/frequent items Conclusion There are many factors to consider (i.e., access patterns, consistency, and performance goals) when implementing caching strategies with Spring Boot and Redis/ElastiCache. The success of the implementation relies on the caching patterns you identified and the monitoring you executed, and optimization based on your real usage patterns is what you can call success. Caching strategies can produce successful implementations. As a recommended way to use caching, you will likely start with simple patterns, and as your requirements evolve, you may develop more complex strategies and develop more strategy complexity. You should consider diligently monitoring and checking periodically to ensure that your caching strategy continues to be effective as your application grows. Finally, caching should not be considered a silver bullet; it is effective when utilized well and maintained. As we have speculated, caching will improve performance and end-user experience and help avoid unnecessary infrastructure costs.
It was a good sunny day in Seattle, and my wife wanted to have the famous viral Dubai Chocolate Pistachio Shake. With excitement, we decided to visit the nearest Shake Shack, and to our surprise, it was sold out, and we were told to call them before visiting. There is no guarantee that it will be available the next day as well because of limited supply. Two days later, I went there again to see if there would be any, and again I was faced with disappointment. I didn't like the way, I either have to call them to check for an item or go to their store to check if it's available. This led to a few ideas for me: What if there was an AI that would do the calls for me and update me when I need a reservation, or wait in that long line for customer calls, and connect with me when I am ready to talk with someone? Some companies are already working on automating this.Is there a way to be notified when an item is available at Shake Shack? If it's not there, can I build a cloud infrastructure for this? Continuing on my thoughts on my second idea, I started looking into their website. It is possible to check if an item is available and add it to the cart for online ordering, which means there are some network calls through which we can identify if the Dubai Chocolate Pistachio Shake is available. Implementation To get the information about availability, we need a few data points. Is there a way to get the store information?How to differentiate whether a store has a shake or not? For getting store information, when we open inspect element and look at network calls, when we select Washington, we see a few interesting calls. Washington state has seven locations, and we need to know which one of these has the shake. If we look at the response of the region information, we were able to get all the state information. Shell curl 'https://ssma24.com/production-location/regions?isAlpha=true' \ -H 'accept: */*' \ -H 'accept-language: en-US,en;q=0.9' \ -H 'authorization: Basic removedSecretCodeHere==' \ -H 'cache-control: no-cache' \ -H 'origin: https://shakeshack.com' \ -H 'platform-os: macos' \ -H 'platform-version: 1.71.20' \ -H 'pragma: no-cache' \ -H 'priority: u=1, i' \ -H 'referer: https://shakeshack.com/' \ -H 'sec-ch-ua: "Google Chrome";v="131", "Chromium";v="131", "Not_A Brand";v="24"' \ -H 'sec-ch-ua-mobile: ?0' \ -H 'sec-ch-ua-platform: "macOS"' \ -H 'sec-fetch-dest: empty' \ -H 'sec-fetch-mode: cors' \ -H 'sec-fetch-site: cross-site' \ -H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36' \ -H 'x-requested-with: XMLHttpRequest' According to this, the WA ID is a3d65b58-ee3c-42af-adb9-9e39e09503c3, and we see the store information if we pass regionId to the API. Shell curl 'https://ssma24.com/production-location/locations?regionId=a3d65b58-ee3c-42af-adb9-9e39e09503c3&channel=WEB&includePrivate=false' \ -H 'accept: */*' \ -H 'accept-language: en-US,en;q=0.9' \ -H 'authorization: Basic removedSecretCodeHere' \ -H 'cache-control: no-cache' \ -H 'origin: https://shakeshack.com' \ -H 'platform-os: macos' \ -H 'platform-version: 1.71.20' \ -H 'pragma: no-cache' \ -H 'priority: u=1, i' \ -H 'referer: https://shakeshack.com/' \ -H 'sec-ch-ua: "Google Chrome";v="131", "Chromium";v="131", "Not_A Brand";v="24"' \ -H 'sec-ch-ua-mobile: ?0' \ -H 'sec-ch-ua-platform: "macOS"' \ -H 'sec-fetch-dest: empty' \ -H 'sec-fetch-mode: cors' \ -H 'sec-fetch-site: cross-site' \ -H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36' \ -H 'x-requested-with: XMLHttpRequest' This information is still not sufficient until we know how to get to a store page and identify the availability. If a store has a shake, it will display the Dubai shake in the shake section. And after investigating the curl calls, I see the call for the menu option. Shell curl 'https://ssma24.com/v1.0/locations/82099/menus?includeOptionalCategories=utensils&platform=web' \ -H 'accept: */*' \ -H 'accept-language: en-US,en;q=0.9' \ -H 'authorization: Basic removedSecretCodeHere==' \ -H 'cache-control: no-cache' \ -H 'channel: WEB' \ -H 'origin: https://shakeshack.com' \ -H 'platform-os: macos' \ -H 'platform-version: 1.71.20' \ -H 'pragma: no-cache' \ -H 'priority: u=1, i' \ -H 'referer: https://shakeshack.com/' \ -H 'sec-ch-ua: "Google Chrome";v="131", "Chromium";v="131", "Not_A Brand";v="24"' \ -H 'sec-ch-ua-mobile: ?0' \ -H 'sec-ch-ua-platform: "macOS"' \ -H 'sec-fetch-dest: empty' \ -H 'sec-fetch-mode: cors' \ -H 'sec-fetch-site: cross-site' \ -H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36' \ -H 'x-requested-with: XMLHttpRequest' If a shake is available, then we see it in the product section. If a store does not have it, then we will not see it in the response. So, all we need to do is get all the store information and verify which stores have the shake. If you look into the store queries, its oloId is linked in the queries related to a store. This can be mapped to the store information from the previous queries, using which I was able to get all the store IDs. With some basic shell scripting, I was able to create this curl script, which will tell which store will have the shake. Shell for store in 203514 236001 265657 82099 62274 96570 203515; do echo $store curl "https://ssma24.com/v1.0/locations/$store/menus?includeOptionalCategories=utensils&platform=web" \ -H 'accept: */*' \ -H 'accept-language: en-US,en;q=0.9' \ -H 'authorization: Basic removedSecretCodeHere==' \ -H 'cache-control: no-cache' \ -H 'channel: WEB' \ -H 'origin: https://shakeshack.com' \ -H 'platform-os: macos' \ -H 'platform-version: 1.71.20' \ -H 'pragma: no-cache' \ -H 'priority: u=1, i' \ -H 'referer: https://shakeshack.com/' \ -H 'sec-ch-ua: "Google Chrome";v="131", "Chromium";v="131", "Not_A Brand";v="24"' \ -H 'sec-ch-ua-mobile: ?0' \ -H 'sec-ch-ua-platform: "macOS"' \ -H 'sec-fetch-dest: empty' \ -H 'sec-fetch-mode: cors' \ -H 'sec-fetch-site: cross-site' \ -H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36' \ -H 'x-requested-with: XMLHttpRequest' | jq| grep '"name": "Dubai Chocolate Pistachio Shake",' done Where each store is the storeId, and I used the jq library to format the response into JSON output, and did a grep on this name, stores that have the shake will give a hit. Sample response Based on this, we know that stores 82099, 62274, and 96570 have the shakes, and I got the store address from the previous calls and got the shakes. :) Conclusion In this document, we went through the network calls to identify the information we need and also figured out a way to get the store information through their network calls. That said, we can also automate this through Selenium (which is compute heavy), or we can use an AI to analyze the website and figure out a way to get this infrastructure ready for us. This will be an advancement that will remove certain QA related jobs from the industry (we are not there yet, but we will be in that position very soon). Most of the approaches we saw today were manual. In my next articles, I will show how we can take this simple idea and build a cloud infrastructure around it. I will be writing about building A cloud-based serverless call to get the info from the web.A cloud-based trigger service that will trigger the serverless call.A storage system to store this data for caching — where can we store the responses?A notification system to notify through a mobile app or in various other ways. If we are getting enough data points, I will be showing how to build an ML model using sparse data points. This is for educational purposes, me and my partner (SGG) took this real world example to build a cloud infrastructure, the code and other artifacts will not be published to ensure Shake Shack websites are not overloaded.
For years, teams have bundled continuous integration (CI) and continuous delivery (CD) into a single concept: CI/CD. This shorthand suggests a seamless pipeline, but in practice, it creates confusion and hides the fact that CI and CD solve very different problems. CI is like the quality control process in a factory, meticulously inspecting and testing every component to ensure it's safe and meets standards before it's ever installed. CD, on the other hand, is the logistics company, using a deliberate strategy to deliver the finished product to the customer, monitoring its journey, and having a plan for a safe return if something goes wrong. Treating them as one often creates unoptimized workflows, blurs the separation of responsibilities, and causes confusion about what is needed when. To further the argument, here are a few concrete reasons why you should be thinking of them as separate things. Different Goals Continuous Integration (CI): Quality and Confidence The core question: “Is my change safe enough to merge?”Focus areas: CI is about validating that code meets quality, security, and compliance standards before it enters the main branch.Practices: automated builds, unit and integration tests, static analysis, vulnerability scanning, dependency checks, artifact signing, and SBOM generation.Outcome: A trusted artifact that can be confidently promoted downstream.Example: If a developer submits a pull request, the CI ensures, within minutes, that the change compiles, passes tests, doesn’t break dependencies, and is safe to integrate. Continuous Delivery (CD): Safe and Progressive Rollout The core question: “Can we reliably and securely deliver this artifact to its intended consumers—whether people, services, or APIs?”Focus areas: CD is about risk management, governance, and customer impact once the artifact is built.Practices: environment promotion (dev → staging → prod), canary or blue/green rollouts, feature flag management, compliance gates, audit logging, and rollback procedures.Outcome: A reliable and observable release process that delivers changes with minimal disruption.Example: A new feature is rolled out gradually to 1% of users, monitored for errors, then expanded to 100% once stability is confirmed. Different Timelines CI: Fast, Continuous, and High-Frequency Every commit and pull request triggers CI.The emphasis is on speed — feedback loops must be short (minutes, not hours) so developers can act immediately.Failures should block merges, preventing broken code from entering the mainline.Analogy: Like a factory inspection that happens on every single part, every time.Example: A developer pushing 10 commits in a day expects CI to validate each one quickly. CD: Deliberate, Risk-Aware, and Staged Rollouts happen at a slower cadence — daily, weekly, or on-demand, depending on risk tolerance.Requires coordination across environments, approvals, and sometimes regulatory compliance.Monitoring windows may extend hours or days after a release before full rollout.Analogy: Like a logistics company moving goods across borders — timing matters, permits matter, and a recall plan must exist.Example: A financial services company might run CI on every change but only release production updates weekly due to compliance requirements. Different Teams CI Ownership: Developers and Platform Teams The goal is to empower developers to move quickly without breaking things.Responsibilities include maintaining build systems, test frameworks, and integration signals.Platform teams provide standardized CI infrastructure, caching, remote execution, and reusable build rules so every team isn’t reinventing the wheel.Example: A platform engineering team runs the CI cluster, but developers own test coverage and fixing broken builds. CD Ownership: Release Engineering, SRE, Operations, and Product The goal is to protect customers and business outcomes during releases.Responsibilities include defining rollout strategies, monitoring production, enforcing governance, and triggering rollbacks if needed.Product owners may set release cadence; operations teams ensure compliance and reliability.Example: SREs own monitoring dashboards and rollback automation, while release managers coordinate staged rollouts across geographies. Different Abstractions CI Abstractions: Build, Test, and Validation Build pipelines – automated workflows that compile and test code, ensuring every commit produces consistent resultsDependency management – standardized handling of libraries and packages to avoid version drift and vulnerabilitiesRemote execution and caching – speeds up builds/tests by distributing work and reusing outputsArtifact signing and storage – ensures artifacts are tamper-proof, traceable, and available for downstream deliverySBOM and compliance checks – generate software bills of materials and validate compliance early in the cycle CD Abstractions: Release, Risk, and Control Environment promotion – moves artifacts through dev, staging, and production with clear validation gates.Rollout policies – blue/green, canary, or progressive strategies that reduce blast radius and manage risk.Feature flags – decouple deployment from release, enabling safe toggles, experiments, and instant rollbacks.Observability and monitoring – built-in health checks and alerts that validate success during rollout.Rollback and governance – automated rollback on failures, with compliance and audit controls for accountability The Danger of One “Pipeline” Forcing CI and CD into a single YAML or workflow couples these abstractions in ways that make systems rigid and fragile. Build failures block releases unnecessarily, rollout strategies become entangled with developer tests, and ownership gets muddied. This reduces agility and increases operational risk. Conclusion CI ensures every change is safe to merge.CD ensures every change is safe to release. CI and CD are both critical, but treating them as one blurs their purpose and diminishes their impact. To build developer platforms that are secure, scalable, and truly empowering, we must stop hiding behind the shorthand of “CI/CD” and instead invest in CI and CD as distinct, composable disciplines. With a clear separation, organizations gain the best of both worlds: developers iterate quickly with confidence, and operations deliver safely with control. Collapsing them into a single concept may sound convenient, but in reality, it creates confusion, slows progress, and makes systems more fragile.
The phrase “Garbage in, Garbage out” is not a new one, and nowhere is this phrase more applicable than in machine learning. The most sophisticated and complex model architecture will crumble under the weight of poor data quality. Conversely, high-quality and reliable data can power even simple models to drive significant business impact. In this post, we will deep dive into why data quality is critical, what dimensions matter most, the problems poor data creates, and how organizations can actively monitor and improve data quality. We will also examine a practical example of credit score and close with the case for treating data quality as a first-class citizen in ML workflows. Why Data Quality Matters in ML Machine learning models approximate the world through the patterns present in the training data. If the data is inaccurate, incomplete, or of poor quality, the model learns a distorted picture of the world. Models trained in this manner are fragile, prone to cold start problems, and can overfit on noise. Stakeholders quickly lose trust in such models as predictions often don’t align with common sense or business intuition. The consequences of poor model predictions are not just technical but also have a large impact on customer experiences, wasted resources, and, in some cases, reputational risk. In simple terms, the machine learning model is the last mile of the ML workflow. The real foundation is the data, and without good quality data, the foundation is too weak to support building any systems on top of it. Key Dimensions of Data Quality Data quality is a broad and abstract concept, but it becomes more measurable when we break it down into different dimensions. Accuracy is the most important and obvious one: If the input data is wrong (e.g., mislabeled transactions in fraud detection models), the model will simply learn incorrect patterns. Completeness is equally important. Without a high degree of coverage for important features, the model will lack context and produce weaker predictions. For example, a recommender system missing key user attributes will fail to provide personalized recommendations. Freshness plays a subtle but powerful role in data quality. Outdated data appears correct, but does not reflect real-world conditions. Predicting user churn based on 6-month-old engagement logs is meaningless in fast-moving consumer businesses. Finally, uniqueness ensures that duplicate records do not bias models by overrepresenting certain patterns or segments. These dimensions, despite being distinct, rarely fail in isolation. In reality, they interact, amplifying the impact. Problems Bad Data Causes in ML Training a model on poor-quality data leads to two major issues. The first one is a poor generalization. A model trained on flawed data learns patterns that do not extend to the real world. A simple example is a credit scoring model trained with an underrepresentation of applicants in the 18–25 age group. It may perform well in testing (since the same flawed dataset is used), but it can fail in the real world since it never learned the patterns for this age group. The second issue is overfitting to noisy signals. The presence of errors and mislabeled data points leads to the model memorizing these quirks instead of learning the correct patterns. For example, instead of learning which users are likely to repay a loan, the model may memorize signals tied to data entry glitches. Despite performing well on training data, they can crumble on real-world examples, as what they memorized does not exist in reality. Example: Credit Scoring Models Credit scoring models provide a practical illustration of the consequences of bad data. Imagine a training dataset where repayment histories are logged incorrectly and updates from smaller banks never make it into the logs. A model built on this flawed foundation is bound to make misjudgments. Reliable borrowers can be flagged as risky, while high-risk individuals may slip through undetected. The impact has two facets: First, lenders lose money due to misallocating credit. Secondly, customers lose trust in the system due to misclassifications and unfair results. In real-world examples, such failures have drawn regulatory scrutiny and negative press (remember the Apple Card debacle), which damages both the business and its reputation. This is not a problem of algorithms but of data quality, a lesson many organizations learned the hard way. Detecting and Improving Data Quality Detecting data quality issues is not just about a single check but rather about continuous monitoring. Statistical distribution checks are the first line of defense, helping detect anomalies or sudden shifts that can indicate broken data pipelines. For example, a 50% drop in the average income of applicants for a credit organization is likely to be a data ingestion issue. Monitoring in production is an essential part of the whole workflow. Even if the training dataset is well vetted, the real world evolves. Features drift, upstream UI changes lead to funnel distribution shifts, and pipelines can break silently. By implementing continuous monitoring, these issues can be detected early before they degrade model performance. On top of detection, prevention is also equally important. Strong processes like label validations and human-in-the-loop validations ensure that ambiguous cases are identified and flagged for expert review. This ensures continuous improvement of the validation process, as well as bad data not reaching the training data stage. It is important to enforce these standards at the data ingestion stage. Data should not flow without proper checks into the feature store. Schema validations, distribution checks, data uniqueness, etc., should be implemented such that they act as firewalls to prevent flawed data from contaminating the pipeline. This approach transforms data quality into a first-class citizen in everyday workflows. Business Implications Ignoring data quality can often turn out to be very expensive. Teams spend large amounts of compute to retrain models on flawed data, to observe little to no business impact. Launch timelines get pushed back since teams spend weeks debugging data issues, a time that could have been spent otherwise on feature development. In industries that are regulated, like finance and healthcare, poor data quality can cause compliance violations and increased legal expenses. The flip side is that a solid investment in data quality can offer one of the highest returns on investment in ML. Cleaner data will almost always improve performance, making the model stronger. It will help build trust among stakeholders when predictions align with reality. It will also reduce operational costs by reducing the amount of time the team spends fighting issues. Many seasoned ML teams have learned that the fastest path to better models is fixing the data rather than onboarding complex architectures. Takeaway Machine learning models rarely fail because of a lack of sophisticated model architectures. They fail because of poor data quality. High-quality and reliable data is the true foundation of sustainable ML systems. A basic model with excellent data often outperforms an advanced model with flawed data. Data quality is not a one-time project but a continuous practice that requires rigorous monitoring, process discipline, and organizational investment. Teams must start treating data quality as a first-class citizen, similar to production code, because data is the input for ML models.
The demand for real-time applications has exploded, from collaborative documents and live data dashboards to multiplayer games and instant messaging. WebSockets, with their persistent, bi-directional communication protocol, have become the de facto standard for building these experiences. However, the traditional approach — running a dedicated server to manage thousands of long-lived connections — introduces significant complexities in scalability, cost, and operational overhead. This paradigm is being fundamentally challenged by the rise of serverless computing. But can the stateless, ephemeral nature of typical serverless functions truly support a stateful, persistent protocol like WebSockets? This article explores a unique and powerful answer to that question, demonstrating how to build a fully serverless WebSocket service using Cloudflare Workers, Durable Objects, and the lightweight Hono framework. By decoupling the routing from the stateful logic and leveraging the global Cloudflare network, we can create a solution that is not only highly scalable and cost-effective but also globally distributed by default. The State-Full Dilemma in a Serverless World Most serverless platforms are built on a "function-as-a-service" model where compute instances are stateless and short-lived. A function spins up to handle a single request and then shuts down, discarding all in-memory state. This design is perfect for REST APIs or event-driven tasks, but fundamentally clashes with the requirements of a WebSocket, which demands a continuous, open connection. Attempting to run a WebSocket server on such a platform typically requires creative, often complex workarounds involving external, stateful databases or pub/sub services. These solutions reintroduce the very complexity and latency that serverless was meant to solve. Cloudflare's platform, however, provides a game-changing primitive that bridges this gap: Durable Objects. Cloudflare's Durable Objects: State at the Edge A Durable Object is a single-instance class that provides a dedicated, stateful environment. Unlike a regular Cloudflare Worker, which is a stateless function, a Durable Object has a persistent identity. When a request is made to a specific Durable Object ID, Cloudflare guarantees that it will always be routed to the same single instance. This instance maintains its in-memory state for as long as it's needed, making it a perfect fit for managing WebSocket connections. Here’s the architecture we’ll build: A Cloudflare Worker acts as the public-facing entry point, serving as a smart router.When a WebSocket connection request arrives, the Worker looks up a specific Durable Object instance (e.g., a "chat room" or a "game session").The Worker then forwards the WebSocket upgrade request to that specific Durable Object, which takes over the connection. This separation of concerns allows the Worker to handle millions of concurrent routing requests efficiently, while the Durable Object focuses solely on managing the state and real-time communication for its unique session. A key benefit is the WebSocket Hibernation API, which allows the Durable Object to become inactive and save on compute costs when no messages are being sent, while still keeping the underlying connection alive. Why Hono? The Framework for the Edge To streamline development, we'll use Hono, a lightweight, fast web framework designed for the edge. Its API is familiar to developers coming from frameworks like Express.js, and it has first-class support for Cloudflare Workers and WebSockets. Hono’s small footprint and performance make it an ideal choice for a serverless environment where every millisecond counts. Let’s dive into the code and build a simple, scalable chat room. Step 1: Project Setup and Configuration First, initialize your project using the Cloudflare CLI. This will scaffold a basic Hono worker. Shell npm create cloudflare@latest serverless-chat-app --framework=hono Next, configure your wrangler.toml file. This is crucial as it informs Cloudflare about our Durable Object binding. TOML name = "serverless-chat-app" main = "src/index.ts" compatibility_date = "2024-05-18" workers_dev = true [[durable_objects.bindings]] name = "CHAT_ROOM" class_name = "ChatRoom" [[migrations]] tag = "v1" new_classes = ["ChatRoom"] This configuration creates a binding named CHAT_ROOM that maps to a class we'll define called ChatRoom. The migration tag is essential for Cloudflare to correctly manage the Durable Object lifecycle. Step 2: The Stateful Durable Object This is the core of our application. The ChatRoom class will hold all our connected WebSocket sessions in memory and manage the message broadcasting. TypeScript // src/chat-room.ts // Define the structure of a chat message for type safety. interface ChatMessage { user: string; message: string; timestamp: number; } export class ChatRoom { private state: DurableObjectState; private sessions: Set<WebSocket>; constructor(state: DurableObjectState) { this.state = state; this.sessions = new Set(); } // Handles all incoming requests. We only care about WebSocket upgrades here. async fetch(request: Request): Promise<Response> { const upgradeHeader = request.headers.get("Upgrade"); if (upgradeHeader !== "websocket") { return new Response("Expected a WebSocket upgrade request.", { status: 426 }); } const { 0: client, 1: server } = new WebSocketPair(); // Accept the connection and add the server-side WebSocket to our session list. this.state.acceptWebSocket(server); this.sessions.add(server); console.log(`New WebSocket connection established. Total connections: ${this.sessions.size}`); // Return the client-side WebSocket back to the client. return new Response(null, { status: 101, webSocket: client }); } // Handles messages received from any connected client. async webSocketMessage(ws: WebSocket, message: string | ArrayBuffer): Promise<void> { if (typeof message === "string") { try { const parsedMessage: ChatMessage = JSON.parse(message); console.log(`Received message from ${parsedMessage.user}: ${parsedMessage.message}`); const response: ChatMessage = { user: parsedMessage.user, message: parsedMessage.message, timestamp: Date.now(), }; const serializedResponse = JSON.stringify(response); // Broadcast the message to all other connected clients in the room. this.sessions.forEach((session) => { if (session !== ws) { session.send(serializedResponse); } }); } catch (e) { console.error("Failed to parse message:", e); } } } // Cleans up a session when a WebSocket connection is closed. async webSocketClose(ws: WebSocket): Promise<void> { console.log("WebSocket connection closed."); this.sessions.delete(ws); console.log(`Total connections remaining: ${this.sessions.size}`); } // Handles errors on a WebSocket connection. async webSocketError(ws: WebSocket, error: any): Promise<void> { console.error("WebSocket error:", error); this.sessions.delete(ws); } } This class is the key to our serverless solution. It is stateful because the sessions set persists in memory for the lifetime of the Durable Object, allowing us to manage and broadcast messages to all connected clients effortlessly. Step 3: The Hono Worker as a Router The main index.ts file uses Hono to handle the incoming HTTP request. Its sole purpose is to get a handle on the correct Durable Object instance and pass the request off to it. TypeScript // src/index.ts import { Hono } from "hono"; import { ChatRoom } from "./chat-room"; // Define the environment bindings, including our Durable Object. type Bindings = { CHAT_ROOM: DurableObjectNamespace; }; const app = new Hono<{ Bindings: Bindings }>(); // Define the WebSocket endpoint. app.get("/ws", (c) => { // Use a hardcoded name to create a consistent Durable Object ID. // This ensures all clients connect to the same 'global-chat-room' instance. const objectId = c.env.CHAT_ROOM.idFromName("global-chat-room"); const durableObjectStub = c.env.CHAT_ROOM.get(objectId); // Forward the original request to the Durable Object stub. return durableObjectStub.fetch(c.req.url, c.req); }); // Export the ChatRoom class so Cloudflare can find it. export { ChatRoom }; // Export the Hono app as the default handler for the Worker. export default app; The idFromName method is fundamental here. It generates a consistent Durable Object ID based on a string name. Any client connecting to /ws will always be routed to the same ChatRoom instance, creating a single, shared chat room for all users. Step 4: The Client-Side Connection The client-side code is a standard WebSocket implementation. It connects to the URL of our Cloudflare Worker and handles incoming and outgoing JSON messages. HTML <!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <title>Serverless Chat</title> </head> <body> <h1>Cloudflare Serverless Chat</h1> <input type="text" id="username" placeholder="Enter your name"> <ul id="messages"></ul> <form id="message-form"> <input type="text" id="message-input" placeholder="Type a message..." autocomplete="off"> <button type="submit">Send</button> </form> <script> const usernameInput = document.getElementById("username"); const messagesList = document.getElementById("messages"); const messageForm = document.getElementById("message-form"); const messageInput = document.getElementById("message-input"); let ws; // Establish the WebSocket connection. function connect() { // Replace with your Cloudflare Worker URL const workerUrl = "wss://<your-worker-subdomain>.workers.dev/ws"; ws = new WebSocket(workerUrl); ws.onopen = (event) => { console.log("Connected to WebSocket server."); }; ws.onmessage = (event) => { const data = JSON.parse(event.data); const li = document.createElement("li"); li.textContent = `${data.user}: ${data.message}`; messagesList.appendChild(li); messagesList.scrollTop = messagesList.scrollHeight; }; ws.onclose = (event) => { console.log("Disconnected. Attempting to reconnect in 5 seconds..."); setTimeout(connect, 5000); }; ws.onerror = (error) => { console.error("WebSocket error:", error); ws.close(); }; } // Handle sending a message. messageForm.addEventListener("submit", (e) => { e.preventDefault(); const message = messageInput.value; const user = usernameInput.value || "Anonymous"; if (message.trim() !== "" && ws && ws.readyState === WebSocket.OPEN) { const chatMessage = { user, message }; ws.send(JSON.stringify(chatMessage)); messageInput.value = ""; } }); connect(); </script> </body> </html> Conclusion This architecture completely removes the need to manage a traditional WebSocket server. The complexity of scalability and state management is handled by the Cloudflare platform, while the developer is free to focus on the application logic. The result is a real-time service that is not only highly scalable and cost-effective but also globally distributed, thanks to Cloudflare's edge network. The combination of Cloudflare Workers, stateful Durable Objects, and the Hono framework provides a robust and elegant solution for building the next generation of real-time applications at the edge.
As data engineers, we’ve all encountered those recurring requests from business stakeholders: “Can you summarize all this text into something executives can read quickly?”, “Can we translate customer reviews into English so everyone can analyze them?”, or “Can we measure customer sentiment at scale without building a new pipeline?”. Traditionally, delivering these capabilities required a lot of heavy lifting. You’d have to export raw data from the warehouse into a Python notebook, clean and preprocess it, connect to an external NLP API or host your own machine learning model, handle retries, manage costs, and then write another job to push the results back into a Delta table. The process was brittle, required multiple moving parts, and — most importantly — took the analysis out of the governed environment, creating compliance and reproducibility risks. With the introduction of AI functions in Databricks SQL, that complexity is abstracted away. Summarization, translation, sentiment detection, document parsing, masking, and even semantic search can now be expressed in one-line SQL functions, running directly against governed data. There’s no need for additional infrastructure, no external services to maintain, and no custom ML deployments to babysit. Just SQL, governed and scalable, inside the Lakehouse. In this article, I will walk you through five such functions using the familiar Bakehouse sample dataset. We will see how tasks that once demanded custom pipelines and weeks of engineering effort are now reduced to simple queries, transforming AI from a specialized project into an everyday tool for data engineers. 1. Summarization With ai_summarize() If you wanted to summarize Bakehouse customer reviews in the past, the workflow was anything but simple. Reviews are often long, unstructured, and written in free-form text — which means they contain everything from slang and typos to emojis, mixed languages, and incomplete sentences. Extracting the raw reviews from Delta tables was only the beginning. The real challenge was making that text usable for downstream analysis. First, you had to clean and normalize the data: removing non-standard characters, fixing casing inconsistencies, stripping out emojis or special symbols, and sometimes even detecting and filtering different languages. Only after preprocessing could you feed the cleaned text into a Python-based summarization model (like Pegasus, BART, or T5). Running those models at scale introduced its own operational overhead: managing GPUs, batching requests, handling long input sequences, and storing the generated summaries back into Delta tables. Finally, you had to write additional logic to extract useful signals — often reducing a verbose, messy review into a short two-sentence takeaway. The entire pipeline was brittle, resource-intensive, and required constant maintenance. With the new ai_summarize() function in Databricks SQL, this entire process collapses into a single line of code. You simply pass the raw review text into the function, and it returns a concise summary directly as part of your query results. No separate preprocessing, no external APIs, no ML pipeline maintenance — just SQL. The function is smart enough to handle free-form text, cut through the noise, and surface the main point of a customer’s feedback. Before we look at the summaries themselves, let’s first explore the raw complexity of the review_text column in the Bakehouse dataset with a simple query: SQL select franchiseID, review_date, review FROM samples.bakehouse.media_customer_reviews LIMIT 25; Now, let's use the ai_summarize function to summarize the reviews: SQL SELECT ai_summarize(CONCAT('Franchise: ', franchiseID, ', Review: ',review)) franchise_review FROM samples.bakehouse.media_customer_reviews LIMIT 25; 2. Translation With ai_translate() Consider this scenario: the Bakehouse management team in Japan wants to analyze customer reviews, but most of the feedback is stored in English. For the Japan team, reading reviews in English creates a barrier; not only does it slow down analysis, but it also introduces the risk of misinterpretation or missed cultural nuances. As data engineers, we’ve all dealt with these kinds of requests: “Can you make this dataset available in our local language?” Traditionally, this meant exporting the reviews out of Delta tables, wiring them into a third-party translation API, managing authentication and quotas, handling errors and retries, and then loading the translated text back into the warehouse. This was a multi-step process that required maintaining fragile ETL pipelines and often raised compliance questions since sensitive customer data had to leave the governed environment. With ai_translate(), this entire workflow collapses into a single SQL query. The function takes raw review text as input and outputs the same content in the target Japanese language. For the Bakehouse dataset, this means the Japan team can instantly access reviews in their local language without any additional infrastructure. SQL SELECT franchiseID, review_date, review, ai_translate(review, 'ja') AS review_japanese FROM samples.bakehouse.media_customer_reviews WHERE review like '%Tokyo%' LIMIT 10; 3. Sentiment Analysis With ai_analyze_sentiment() Traditionally, data engineers or data scientists had to build or fine-tune a sentiment analysis model in Python, often starting with frameworks like TensorFlow, PyTorch, or Hugging Face. The process involved collecting labeled data, training or fine-tuning a classifier, validating the model, and then packaging it into a deployable service. Once deployed, the service had to be hosted on a GPU or CPU endpoint, monitored for uptime, and maintained with scaling logic for production loads. On top of that, engineers had to write pipeline jobs to send raw review text to the endpoint, collect the predictions, and store results back into Delta tables. All this work just to answer a seemingly simple question: “Are our customers happy or not?” With Databricks’ ai_analyze_sentiment() function, that entire workflow is reduced to a single line of SQL. There’s no need to train models, deploy endpoints, or manage infrastructure. You can feed raw review text directly into the function, and it automatically returns a sentiment label such as positive, negative, or neutral. SQL SELECT review, ai_analyze_sentiment(review) AS sentiment FROM samples.bakehouse.media_customer_reviews LIMIT 25; 4. Mask PII Data With ai_mask() Protecting personally identifiable information (PII) is one of the challenging tasks for data engineers. I have written an elaborate article in DZone on building scalable data with security. The ai_mask() function automatically detects and masks the PII data based on the input parameters. The Bakehouse analytics team can safely analyze reviews without exposing sensitive customer data, all directly in SQL, no custom regex required. SQL SELECT franchiseID,review_date,ai_mask(review, ARRAY('PERSON', 'EMAIL', 'PHONE_NUMBER')) as masked_review FROM samples.bakehouse.media_customer_reviews LIMIT 10; Conclusion The examples we explored are ai_summarize, ai_translate, ai_analyze_sentiment, ai_parse_document, and ai_mask, which show how the usage of AI simplifies workloads for engineering teams and allows them to conduct quick analytics. Tasks that once required complex pipelines, custom Python scripts, or external APIs are now reduced to simple, one-liner SQL functions. These AI SQL functions are in Public Preview at the time of writing this article and may evolve as Databricks expands its capabilities.
Preamble I recently had a conversation with my friend about starting a new company. We discussed the various stages a company should go through to become mature and secure enough to operate in the modern market. This article will outline those stages. The suggested approach is based on the following principles: Security by defaultSecurity by designIdentification, authentication, and authorizationSegregation of responsibilities You can follow this flow assuming that you're starting a product from scratch without any existing VNETs, IDPs, or parent companies' networks. However, if you have any of these things, you must adjust the flow accordingly. Here are some definitions of terms that we'll be using in this article: SIEM (Security Information and Event Management) – an approach and tool used to monitor anomalies in networks and applications.SOAR (Security Orchestration, Automation, and Response) – a tool that sources the events produced by SIEM and applies corresponding automated responses. Security information and event management are approaches and tools that monitor anomalies in networks and applications. Amble You have an idea and a couple of developers with burning ideas, but at least you would like to feel safe from a security perspective. You have already imagined the financial model and have a product vision; you are almost ready to invest time and money. The first step is to choose the identity provider (IDP). Why? Sooner or later, your development team will grow, and managing identities across multiple non-integrated services will become a headache. You can choose: Public IDPs, such as Azure Directory services and AWS IAM.Hosted IDPs, such as MS AD, Simple AD, open-source LDAP services, etc. The choice will significantly affect the tools and order of tasks that need to be done. For generic cases, assume a public IDP provider is used. In most cases, the IDP tool dictates a method for implementing access control policies; however, if not, you need to choose among RBAC, GBAC, ABAC, or other options. The next step is to create a plan for environments, network planning, and a network map. Why is it important? Network segregation is significant not only for the operations team, which will handle maintaining applications, and the DevOps team, which will care about deployments, but also for security reasons: You will need to restrict network access and implement SIEM/SOAR systems. Without wise network planning, these systems become ineffective. We will start building a closed perimeter for our product during this step. Only authenticated and authorized users should have access to it. Therefore, it’s time to select public, private, or closed subnets per environment, specify ranges for tunnels (if applicable), and define a VPN subnet. We also need to deploy a VPN server and configure it to use our IDP as a source of truth. At this stage, we are only ready to start developing the MVP. POC/MVP/Demo Stage The application deployment is performed in subnets based on its logical structure. However, the build/deploy engine cannot reach orchestration or database endpoints from external networks. There are three ways to overcome this issue: Deploy the build engine with build agents inside networksDeploy the build agent and configure management with the pull modelUse the GitOps approach Now that there is a codebase, it's time to conduct SAST (Static Application Security Testing) and DAST (Dynamic Application Security Testing). SAST tools, such as SonarQube, Snyk, and Fortify SCA, as well as DAST tools like Veracode, Acunetix, and Burp Suite, can be used for this purpose. Some of these tools are cross-functional and can play both roles, and the difference is only in the application stage. Over several sprints, the product will be developed to a decent quality that can be delivered to demo users or shown to investors. Live Your product is now ready to go live. We have a deployed stage with an identical network and application to the production environment. This is the time to configure firewall rules, NACLs, or any other method of restricting access to people who are not part of the QA/Security/Ops or any other assigned team. We need this because our application was previously located within a closed perimeter with no external load balancers, CDNs, or WAFs. Therefore, these must be rolled out, configured, checked, and tested consistently. Once we have identified potential live users, we will deploy a SIEM system to track malicious activity within the subnets. This will help us prevent cybersecurity attacks at an early stage. One significant difference between enterprises and start-ups is the implementation of SSO. Although it can be costly and pose integration challenges, it standardizes sign-in approaches and protects the authentication endpoints. Post-Live Security I omit the questions of security hardening and legal requirements because they’re particular to industry and country, and proceed to post-production ideas. Our product is observed and defended, and it’s time for proactive deeds: Deploy the SOAR system to offload the security teamEstablish End-to-End encryptionLearn AI models based on security issues reported by SIEM/SOAREstablish a security audit process, including regular penetration testing, red team and blue team exercises, etc. If you want to combat cybercrime consistently, it’s a good idea to set up a honeypot and report abuse to databases. Epilog As you can see, building a secure startup is not as complex as it appears, and it is much easier to do so at the early stages to avoid financial and reputational losses.
TL;DR: AI Transformation Failures Organizations seem to fail their AI transformation using the same patterns that killed their Agile transformations: Performing demos instead of solving problems, buying tools before identifying needs, celebrating pilots that can’t scale, and measuring activity instead of outcomes. These aren’t technology failures; they are organizational patterns of performing change instead of actually changing. Your advantage isn’t AI expertise; it’s pattern recognition from surviving Agile. Use it to spot theater, demand real problems before tools, insist on integration from day one, and measure actual value delivered. The Same Movie With New Costumes Organizations are failing their AI transformation in the same ways they failed their Agile transformations: Not in similar ways. In the same way. The question is: why do we keep watching the same movie? The Pattern That Keeps Repeating Start with what we know happens. An organization announces its AI transformation. Leadership showcases demos of AI capabilities; impressive ones that generate real enthusiasm. Tools are evaluated and purchased. Metrics dashboards appear to track adoption rates. Pilot teams report spectacular results. Then production reality hits. The demos don’t integrate with existing systems. The tools solve problems nobody actually has. The pilots can’t scale beyond their controlled environments. The metrics show 87% adoption while business outcomes remain unchanged. This sequence isn’t random. It’s structural. When organizations need to appear innovative, they optimize for what’s visible rather than what’s valuable. Demos are visible. Integration work isn’t. Tool purchases are visible. Problem analysis isn’t. Adoption metrics are visible. Business outcomes are messier. You lived this with Agile. Sprint reviews became slide decks instead of collaboration with customers and stakeholders in real-time on working software. Teams spent months configuring Jira while their actual workflow problems went unaddressed. Velocity charts showed steady progress while customers saw no change in delivery speed or quality. (As it turns out, velocity is the easiest metric to manipulate.) The costume changed. The performance didn’t. Why Organizations Perform Change Instead of Changing There is a reason organizations default to theater: it is safer than transformation. Real transformation requires admitting current approaches aren’t working. It requires changing power structures, incentives, and decision rights. It requires accepting failure as learning. Most importantly, it requires patience for messy, non-linear progress. Performance is cleaner. You can schedule it, script it, and control it. You can show steadily improving metrics (even if those metrics are meaningless). You can claim success based on activity rather than outcomes. You can avoid the uncomfortable questions about why things aren’t actually changing. Consider the “Definition of Done” parallel. In Scrum, teams under pressure often erode their Definition of Done, ultimately defining “done” as “demo-able at the Sprint Review” rather than “deployable.” (And if the latter happens, nevertheless, it requires a “known issues will be fixed later” label.) With AI, “ready” means “the model passes offline tests” rather than “the AI workflow produces safe, reproducible, valuable outcomes in production.” Both redefinitions serve the same purpose: they make it easier to claim success without achieving it. The Four Acts of Transformation Theater Act One: Tools to the Rescue! Before identifying any specific problems, the organization evaluates platforms. Months of vendor comparisons, proof-of-concepts, and feature matrices. This feels like progress: there are meetings, decisions, and purchase orders. But it’s the same error as believing Jira configuration would create agility. Tools don’t transform organizations. Solving real problems transforms organizations. Act Two: The Crack Pilot Team A team working in isolation, with special resources and exemptions from standard constraints, achieves remarkable results. Their AI system processes documents 100x faster! Of course, it only works with their test data, bypasses security protocols, and ignores regulatory requirements. But those details don’t make it into the success story; just like those high-performing Agile teams whose success couldn’t survive their first contact with enterprise governance. Act Three: All Metrics Show Green Dashboards everywhere tracking adoption, usage, and engagement. No tracking of problems solved, time saved, or value delivered. It’s velocity theater all over again: measuring activity because it’s easier than measuring outcomes. Teams learn quickly: generate the metrics that make management happy, regardless of actual impact. Act Four: The Integration Collapse Reality arrives; what rules supreme in the sandbox fails in practice: The AI solutions can’t access relevant production data. They don’t comply with regulations. They can’t handle real-world edge cases. The governance board won’t approve them. The run costs are unsustainable. The organization quietly abandons or drastically scales back the initiative, though the theater continues. Reading the Signs You don’t need AI expertise to spot these patterns. You need the pattern recognition you have already developed: When someone says “We’re using AI for…” but can’t show you the actual workflow in production, you’re watching AI theater.When problem statements trail tool decisions: “We bought OpenAI licenses, now what should we do with them?”—you’re watching the tool quest.When success stories come only from isolated teams with special conditions, you’re watching the miracle pilot.When dashboards track activity but no one can articulate business impact, you’re watching the metrics mirage. The AI Transformation Alternative Path Some organizations do get this right. They start with specific, measurable problems. They run short experiments with clear hypotheses. They measure outcomes, not activity. They integrate with existing systems from day one. They define explicit quality bars that include safety, legality, and operational readiness. This approach isn’t revolutionary. It is the empirical approach Agile was supposed to enable. The difference is in doing it instead of performing it. You are probably reading this because you recognize the patterns. You have seen this movie before. You know how it ends. The question is what you will do about it. You could play along with the theater because it is politically safer. You could become cynical and disengage. Or you could use your pattern recognition to ask the uncomfortable questions, insist on real problems before tool purchases, demand integration plans before pilots, and measure outcomes instead of activity. (Sounds familiar?) You’re not being negative when you point out these patterns. You’re being empirical. The most valuable person in your AI transformation isn’t the prompt engineer. It is the practitioner who can say, “We’ve seen this pattern before, and here’s how it ends.” That’s you. AI Transformation Failure — Conclusion You have read this far because you remember “last time” and you do not believe in “this time will be different.” Maybe it was the tool evaluation committees that never asked, “What problem are we solving?” Perhaps it was the pilot teams freed from every organizational constraint, declaring victory. Maybe it was the dashboards tracking everything except value delivered. Whatever it was, you recognized it. Not because you are an AI expert, but because you have lived through this before with Agile. That recognition is your competitive advantage. While others are mesmerized by the demos and the metrics and the promises, you can see the patterns. You know that “87% AI adoption” is the new “200% velocity improvement:” meaningless numbers that hide an absence of real change. The patterns documented here aren’t warnings about what might happen. They are descriptions of what’s happening right now. The question isn’t whether your organization is performing these patterns. It is how many of them you are watching simultaneously. But here is what is different this time: you are different. You have developed immunity to transformation theater through exposure. You can spot the difference between a demo and a deployment. You know what happens to pilots when they hit production constraints. You have seen metrics games before. This gives you a choice. Not between supporting or opposing AI; that is a false binary. The choice is between performing a transformation (theater) and actually transforming. Between measuring adoption and measuring impact. Between isolated success and integrated delivery. The hardest part isn’t technical. It is political. Asking “what specific problem does this solve?” in a room excited about possibilities. Insisting on production constraints when everyone wants to see potential. Measuring outcomes when activities are easier to track. These aren’t acts of resistance. They are acts of professionalism. You are not the skeptic who shoots down innovation. You are the practitioner who knows the difference between motion and progress. The transformation your organization needs isn’t about AI. It is about finally learning the lessons from every transformation that came before. The patterns won’t change until organizations stop rewarding theater over reality, activity over outcomes, and tools over problems. You can’t change the whole system. But you can change your part of it. One real problem solved. One honest metric. One integrated solution. One voice in the meeting asked: “How is this different from when we did this with Agile?” That voice matters more than you think. Because someone needs to say what everyone who survived the last transformation is thinking: we’ve seen this movie before, and we know how it ends. Unless, this time, we change the script.
Bounded Rationality: Why Time-Boxed Decisions Keep Agile Teams Moving
October 1, 2025 by
September 30, 2025
by
CORE
Why CI and CD Should Be Treated as Separate Disciplines (Not CI/CD)
September 29, 2025 by
Securing the Model Context Protocol (MCP): New AI Security Risks in Agentic Workflows
October 2, 2025 by
Testing Updates in Insert-Only Ledger Tables and Understanding Updates in Updatable Ledger Tables
October 2, 2025
by
CORE
AI Infrastructure Guide: Tools, Frameworks, and Architecture Flows
October 2, 2025
by
CORE
Securing the Model Context Protocol (MCP): New AI Security Risks in Agentic Workflows
October 2, 2025 by
Infrastructure as Code (IaC) in a Multi-Cloud Environment: Consistency and Security Issues
October 2, 2025 by
Unpack IPTables: Its Inner Workings With Commands and Demos
October 2, 2025 by
AI Infrastructure Guide: Tools, Frameworks, and Architecture Flows
October 2, 2025
by
CORE
October 2, 2025 by
Patterns for Building Production-Ready Multi-Agent Systems
October 1, 2025 by
Testing Updates in Insert-Only Ledger Tables and Understanding Updates in Updatable Ledger Tables
October 2, 2025
by
CORE
AI Infrastructure Guide: Tools, Frameworks, and Architecture Flows
October 2, 2025
by
CORE
Securing the Model Context Protocol (MCP): New AI Security Risks in Agentic Workflows
October 2, 2025 by
AI Infrastructure Guide: Tools, Frameworks, and Architecture Flows
October 2, 2025
by
CORE
Building ML Platforms for Real-Time Integrity
October 2, 2025 by